Friedrich Wilhelm Schröer
Fraunhofer Institute for Computer Architecture and Software Technology
DRAFT
ACCENT detects at parsing time whether a given text is ambiguous. It would be desirable to get an early indication of what parts of a grammar could lead to ambiguities.
In general it is undecidable whether a given grammar is ambiguous. But if a given grammar is ambiguous this can be detected by enumerating and checking the token strings of a language. If such an algorithm presents a text with two different parsing trees we know that the grammar is ambiguous. But if the grammar is unambiguous the algorithm may not terminate.
AMBER is a tool that systematically generates (prefixes of) example strings of a given grammar and checks them for ambiguity. Because this is done using an extremely efficient algorithm it is realistic to check millions of such examples in short time. Whenever two examples have a common prefix the prefix is inspected only once.
Hence one has a good chance to detect a problem. Nevertheless, the user should be aware of the fact that the search space in general is infinite and that the number of examples grows exponentially with their length. AMBER has a number of options to influence the search. For example, the user can choose to inspect all examples up to a given length or a randomly selected subset allowing examples of greater length in reasonable time.
If spec.acc is a grammar in the ACCENT grammar language, then the command
(In a future version AMBER will directly process the grammar in order to avoid the compilation step if the grammar has changed.)
Example
For example, when used with the ANSI C grammar, the command
Progress information is displayed on stderr.
When a rule is processed we use a "dot" (denoted by "*") to indicate the actual position inside the rule. For example, in
N : M_1 ... * M_i ... M_nthe next symbol being to be processed is M_i. Such a "dotted rule" is called an item.
An item has also a "back-pointer" to find items that triggered the actual one (I do not discuss this here).
The algorithm constructs an item list for each input token.
The kernel of the item list for a particular input token is constructed by a step called the scanner.
If 't' is the current input token then all items of the previous list that have the form
M : ... * 't' ...are placed into the next item list where the dot is advanced behind the token
M : ... 't' * ...This indicates that 't' has been recognized and the symbol following it has to be processed.
If the dot appears in front of a nonterm, the "predictor" is invoked. It inserts items that start the processing of the member.
If the item has the form
M : ... * N ...and there is a rule
N : Alphathen an item
N : * Alphais inserted.
If the dot appears at the end of a rule, the "completer" is invoked. It takes the item that caused the processing of this rule and puts the dot after the corresponding nonterminal.
If the item has the form
N : ... *and this item was initially triggered by an item
M : ... * N ...then an item
M : ... N * ...is added, indicating that the member N has been processed.
Processing starts with the item
YYSTART : * S YYEOFwhere S is the start symbol of the grammar. The closure of this item determines the initial item list.
When the algorithm has processed i tokens it has constructed i item lists that contain all information to parse all continuations of the token list. The last item list has items of form
M : ... * 't' ...that will be processed by the Scanner to construct the kernel of the next item list. All tokens 't' in those items are valid continuations of the current token string.
We collect these in a list of valid tokens and treat each separately as if it would have been the next source token and construct the next item list. This is embedded into a recursive procedure that extends a given token string of length n :
extend (n)
{
if (search ends at n) return;
l = list of valid tokens;
for (each s in l)
{
let s be the next token;
next_item_list();
extend(n+1);
}
}
Using this approach,
only valid token sequences are considered.
Instead of parsing each example separately
and from the beginning, examples with common prefixes
are parsed together where the prefix is processed only once.
The technique used to detect ambiguities is taken from ACCENT[2].
| [1] |
Earley, J.: An Efficient Context-Free Parsing Algorithm Communications of the ACM Volume 13, Number 2, February 1970, pp. 94-102 |
| [2] |
Schröer, F.W.: The ACCENT Compiler Compiler GMD Report 101 German National Research Center for Information Technology July 2000 |